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Overview 


Genome-wide association studies have successfully identified a growing number of 
common variants that robustly associate with a wide range of complex diseases and 
phenotypes. In the majority of cases though, the variants are predicted to have small to 
modest effect sizes, and, due to the technologies used, many of the signals discovered 
so far may not be the causal loci. As rare variation studies begin to explore the lower 
ranges of the allele frequency spectrum, using whole genome or whole exome 
sequencing to capture a larger proportion of variants, we expect to find variants with a 
more direct causal role in the phenotype(s) of interest. Interpreting possible functional 
mechanisms linking variants with phenotypes will become increasingly important. 
Experimental investigation is the most direct way to establish if a candidate vari- 
ant is causally involved in some phenotype, but it is a costly and time-consuming 
process, and so it is important to try to use as much existing relevant information as 
possible to prioritise variants for follow-up and to help formulate specific hypotheses 
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about functional mechanisms to inform subsequent experiments. The genome is 
complex and different classes of variants may have a wide range of, possibly tissue- 
specific, effects depending on their genomic context. In this chapter, we review some 
important classes of genome annotation and highlight some relevant computational 
tools and databases to help interpret and prioritise candidate variants depending on 
their genomic context. These resources may also play a role in the discovery of rare 
variant signals, as association techniques based on collapsing multiple rare variants 
together (reviewed in Chaps. 13 and 14) may use annotation of genes and regulatory 
elements to select biologically meaningful groups of variants, and other techniques 
can use prediction scores to upweight likely functional variants to increase statistical 
power. In this chapter, we focus on smaller-scale variants such as single nucleotide 
variants (SNVs) and short sequence insertions and deletions (indels), though some of 
the approaches we discuss may also be applied to larger structural variants. 


Mapping Variants to Annotated Features 


An obvious first step in trying to interpret possible functions of sequence variants is to 
identify overlapping genomic features that may be affected. Features of particular 
interest include protein-coding and non-coding genes, transcription factor binding 
sites and other potential regulatory regions. There are a wide range of resources and 
databases that can be used to identify likely functional genomic features, from very 
specific resources on a single class of feature such as the miRanda databases of 
microRNA (miRNA) target sites (Betel et al. 2007) to broad collections of annotations 
such as the Ensembl (Flicek et al. 2012) and UCSC (Meyer et al. 2013) databases. 

For small numbers of variants, looking up the relevant loci in a genome browser, 
such as Ensembl or UCSC, is a convenient way to find overlapping or nearby fea- 
tures and to visualise variants in their genomic context. Both browsers contain a 
wealth of information on genes, regulatory regions and informative local genomic 
properties such as conservation, GC content and co-located or nearby variants (all 
of which we discuss in more detail later). For larger numbers of variants, automated 
approaches are clearly required. For simply identifying features overlapping vari- 
ants, software packages such as BEDTools (Quinlan and Hall 2010) and BEDOPS 
(Neph et al. 2012a) provide powerful and efficient tools for computing overlaps and 
proximity (among other useful metrics) between large numbers of genomic loci and 
can read common variant file formats such as VCF and GVF and annotation files in 
widely used formats such as BED, GFF, GTF and SAM (more details on these for- 
mats are given in the Appendix). More variation specific tools such as the Ensembl 
Variant Effect Predictor (McLaren et al. 2010) and ANNOVAR (Wang et al. 2010) 
also identify a wide range of features overlapping variants, but can also make more 
specific predictions depending on the affected feature. 

For many available annotations, especially those in non-coding regions, our under- 
standing of the importance of specific genomic sequences is still in its infancy, and all 
we can report is that the variant overlaps the relevant annotation. For several classes 
of feature, such as genes and transcription factor binding sites, we have a more 
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detailed understanding of the importance of particular nucleotide sequences and so 
can make reasonably specific predictions about the effect of an allele on the element, 
as we discuss below. Even when we cannot take this further step, these overlaps pro- 
vide some indication of the genomic context of the variant locus, and several studies, 
including the ENCODE consortium (Consortium, The ENCODE Project 2012), have 
found significant enrichments of trait-associated variants in less well-characterised 
regions, such as DNasel hypersensitive sites, suggesting that these variants, or those 
nearby, affect some as-yet uncharacterised functional elements. 


Variants Falling in Protein-Coding Genes 


Protein-coding genes are perhaps the best understood genomic features, and given 
that a variant falls somewhere in an annotated gene structure, there are a number of 
predictions that can be made about its possible effect on gene function, such as 
whether it is predicted to change the amino acid sequence of the encoded protein, 
introduce premature stop codons or affect mRNA splicing. There are several com- 
putational tools that are designed to make these predictions that work mainly by first 
identifying annotated genes overlapping the variants and then applying various bio- 
logically informed rules based on both the variant location and allele sequences. 

The Ensembl VEP uses a set of standardised consequence terms defined in the 
sequence ontology (SO) (Eilbeck et al. 2005) to describe the predicted effect of a 
genetic variant. The use of a standardised term set is important as it allows comparison 
between the results of different annotation systems, and the ontology structure supports 
biologically informed grouping and querying of annotation results. The VEP also pro- 
vides a wide range of ancillary annotation such as cDNA and protein relative coordi- 
nates, predicted amino acid substitutions (AASs) and SIFT and PolyPhen predictions 
for missense variants (discussed below). Several other similar tools such as ANNOVAR 
and VAT (Habegger et al. 2012) work in a similar way but have different performance 
characteristics and vary in the amount of ancillary information available. 

Variants that are predicted to have the most severe effects on coding genes 
include those that introduce premature stop codons, disrupt essential mRNA splic- 
ing signals and indels that change the translational reading frame. These are col- 
lectively termed “loss of function” (LoF) variants and are typically expected to be 
highly deleterious as they have been implicated in a number of severe diseases 
(MacArthur et al. 2012). Stop codons introduced early in the transcript mean that 
the mRNA is likely to undergo a cell surveillance process known as “nonsense- 
mediated decay” (NMD) (Isken and Maquat 2007) where the aberrant mRNA is 
degraded to avoid the production of deleterious protein isoforms and so may effec- 
tively knock-down the affected transcript. However, stop codons towards the end of 
the transcript may escape this process and only truncate a few amino acid residues 
and therefore have minimal effect on protein function, so not all premature stop 
variants should be considered functionally equivalent. 

Frameshifting variants may lead to an entirely different translated sequence and 
substantial elongation or truncation of the protein product. As with premature stop 
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codons, the position of the variant in the coding sequence will clearly affect the 
severity of the variant. Hu and Ng (2012) present a new tool that aims to identify 
frameshift variants that are likely to be truly deleterious and find that variants that 
affect fewer and less conserved residues are more likely to be tolerated. Hu and Ng 
(2012) also find that proximal frameshift variants are frequently compensatory in 
that a nearby downstream variant restores the reading frame disrupted by an 
upstream variant, highlighting the importance of considering the haplotype back- 
ground of a variant. 

Variants that disrupt the essential two nucleotide donor and acceptor splice sites 
at either end of introns are also typically expected to severely disrupt the protein 
product. While these essential positions are indeed highly conserved, there is also 
substantial sequence conservation in the flanking nucleotides and in the branch site 
towards the 3’ end of the intron, so variants in these regions may also affect accurate 
splicing (indeed, this is one way in which “synonymous” variants in coding sequence 
might still have functional effects). Desmet et al. (2009) introduce a tool called the 
Human Splicing Finder which uses position weight matrices to predict the effect of 
different alleles on splicing motifs in all these relevant regions. 

It is important to note that despite the expected severity of loss of function vari- 
ants, there are still a substantial number of common LoF variants in human popula- 
tions, and each individual is predicted to carry up to 20 such variants in a homozygous 
state (MacArthur et al. 2012). This observation implies that we should be cautious 
about the interpretation of LoF variants without further phenotypic evidence. 
MacArthur et al. (2012) use their extensive survey of LoF variants found in the 1000 
Genomes Project data to develop a classifier that can identify genes that are likely 
to be tolerant of LoF variants based on conservation and protein network informa- 
tion, and so this approach may be used to filter LoF variants to identify those more 
likely to have some phenotypic effect. 

Other forms of coding variant that have been the subject of substantial research 
are missense variants predicted to result in a single AAS; these are an interesting 
class of variant as it appears that some AASs do not have any noticeable effect on 
protein function and the underlying variants are common in human populations, 
while others have been implicated in a wide range of diseases—around half of the 
mutations implicated in human disease from the Human Gene Mutation Database 
(HGMD) are classified as missense (Stenson et al. 2009). Several computational 
techniques have been developed to try to discriminate damaging AASs from appar- 
ently benign variants. These approaches can be divided into two main classes: those 
that make predictions based on some biologically informed assumptions about prop- 
erties of important residues and those that are trained by machine learning methods 
to discriminate between benign and damaging substitutions. A widely used example 
of the first class is an algorithm called SIFT (Ng and Henikoff 2001) which makes 
predictions based entirely on a protein multiple sequence alignment (MSA) by look- 
ing for evidence that a substitution at a specific residue might be tolerated because, 
for example, the mutant residue (or one with similar physico-chemical properties) is 
found at that position in a related protein from another species, or conversely if a 
substitution is likely to be damaging because the affected residue is highly con- 
served. A popular example of the second class of approaches is PolyPhen-2 
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(Adzhubei et al. 2010) which uses a set of missense variants annotated in the UniProt 
database (UniProt Consortium 2011) as involved in human disease and trains a naïve 
Bayes classifier to discriminate between these damaging variants and a control set of 
common, polymorphic variants. PolyPhen uses a set of 12 predictive features for 
each variant, including a similar conservation metric from an MSA as used by SIFT, 
three-dimensional structural data, whether the residue is in a transmembrane region 
or a protein domain inter alia. There are also a number of other tools that take simi- 
lar approaches but use different sets of annotations. Thusberg et al. (2011) provide a 
recent review and performance comparison of several AAS prediction tools, and Liu 
et al. (2011) present a database called dbNSFP which contains precomputed predic- 
tions from four tools for all possible AASs in the human genome. 

Given the wide variety of these AAS effect prediction tools, a few methods have 
recently been proposed that combine predictions from a number of different tools to 
try to improve performance over any single technique. One of the first such methods 
is known as Condel (Gonzalez-Pérez and Lopez-Bigas 2011) and integrates scores 
from five different predictors using a weighted average which the authors show 
gives a substantial improvement in sensitivity and specificity on some test sets. 
CAROL (Lopes et al. 2012) integrates predictions from SIFT and PolyPhen using a 
weighted Z-method, and the authors find that this method can outperform Condel on 
their test set. There are plug-in modules available for the Ensembl VEP to compute 
both Condel and CAROL scores for missense variants. 

Proteins are typically composed of one or more functional domains, and when 
considering the effect of any coding variant, it is also useful to check if it might 
disrupt any important protein domains. There are a number of databases of well- 
characterised protein domains, such as Pfam (Punta et al. 2011) and InterPro 
(Hunter et al. 2012), and Ensembl (among other resources) provides a mapping of 
these domains to gene annotations. 

Variation in other gene regions, such as introns and the 5’ and 3’ untranslated 
regions (UTRs), is typically currently annotated by tools such as the Ensembl VEP 
and ANNOVAR simply as an overlap. However, these regions are known to contain 
important signals for gene regulation and may also affect mRNA structural stability. 
Regulatory features in the UTRs include miRNA target sites found in the 3’ UTRs 
of many genes. These short sequences are bound by specific miRNAs which typi- 
cally serve to suppress translation of the mRNA and act as a form of post transcrip- 
tional gene regulation. The miRanda algorithm for miRNA target prediction (John 
et al. 2004) can be used to identify variants that disrupt likely target sites and may 
also be applied to identify variants that introduce novel target sites. As well as 
important sequence signals for mRNA splicing, intronic regions may also contain 
many of the regulatory elements discussed later, such as transcription factor binding 
sites and enhancers. 

An important consideration when interpreting all forms of genetic variants is that 
many human genes are subject to alternative splicing and may give rise to a number 
of possible transcripts, frequently depending on tissue or developmental stage. 
A single variant may therefore be predicted to have a number of different effects 
depending on which transcripts it falls in—an apparently highly deleterious 
premature stop codon may have little consequence if it is found in an exon that is 
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rarely included in any transcript. Rich and detailed annotation of alternatively 
spliced transcripts is therefore very important for accurate variant interpretation, 
and the GENCODE gene set (Harrow et al. 2012) represents perhaps the most 
detailed set of manually annotated transcript models available for human. 

Even if a variant is predicted to affect an important transcript, it appears that 
even severely deleterious genetic variants may be tolerated as long as they are in a 
heterozygous state and so only disrupt one copy of the gene, although it appears that 
for some genes (termed haploinsufficient), a single functional copy is not adequate 
to maintain function (Huang et al. 2010). Huang et al. develop a predictive model of 
genes that are likely to be haploinsufficient based on a number of gene-level annota- 
tions and which can be used to further prioritise variants and highlight the impor- 
tance of considering variant annotations at the organismal level. 


Variants in Non-coding Genes 


There is increasing interest in transcribed regions of the genome that do not give rise 
to protein-coding mRNAs, and a number of different classes of non-coding RNA 
genes have now been identified and are extensively annotated in the GENCODE 
resource. There has been less work on interpreting the possible effects of variants in 
non-coding genes, but some of the approaches described above, such as annotation 
of variants affecting splicing, may also be applied to these. 

The function of many RNA genes depends on the secondary structures formed 
after the RNA has been transcribed from genomic sequence. Intra-strand base pair- 
ing is an important factor in determining this structure, and sequence variants that 
disrupt base complementarity may thus affect the function of RNA genes. The 
RNAsnp server (Sabarinathan et al. 2013) uses RNA structure prediction algorithms 
from the Vienna package (Hofacker 2003) to predict the possible effect of variants 
on RNA secondary structure. 

Some specific classes of RNA genes have other well-characterised functional 
sequence regions. As discussed above, miRNAs serve an important role in gene 
regulation, and they do so by binding specific sequences in the UTRs according to 
base pair interactions. Sequence variants in the binding regions of mature miRNA 
transcripts may therefore have potentially complex downstream effects on regula- 
tory networks. 


Intergenic and Regulatory Variants 


Genetic regions remain the most well-characterised regions of the genome, but 
recent large-scale efforts such as the ENCODE and the NIH Roadmap Epigenomics 
projects have made available substantial amounts of information about biochemical 
activity in the ~98 % of the genome that does not encode protein. These data are 
varied in format and range from specific annotations identifying regions of the 
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genome bound by transcription factors (TFs) to broad epigenetic marks such as his- 
tone modifications and long-range chromatin interactions. Given that the majority of 
trait-associated variants, 88 % according to a recent survey (Hindorff et al. 2009), do 
not map to protein-coding loci, the availability of these data provides a promising 
opportunity to interpret the large numbers of non-genetic variants. It is not, however, 
currently clear to what extent genetic variation in many of the regions identified in 
these projects might have phenotypic effects. 

Perhaps the most readily interpretable regulatory annotations are TF binding sites. 
Many TFs bind specific sequence motifs in the genome, and so variants that result in 
changes in these motifs, particularly at high-information content positions within the 
motif, might have a direct effect on the binding affinity of the relevant proteins. 
However, Maurano et al. (2012) find that variants at high-information content, con- 
served residues of the CTCF TF motifs aligned under regions with experimental evi- 
dence of CTCF binding, had no effect on binding intensity, implying there is substantial 
contextual buffering of variants in TF motifs, and it appears our understanding of the 
importance of specific sequence variants in these regions is still limited. 

As with transcript splicing signals, TF motifs are typically represented as position 
weight matrices, and so the effect of a variant allele on an aligned motif can be cal- 
culated straightforwardly as the difference in alignment score between the two 
alleles. However, TF motifs are typically short—on the order of 10-20 nucleotides in 
length—and are found in numerous locations throughout the genome, and so most 
instances of motifs are unlikely to be functionally important (Pique-Regi et al. 2011). 
It is therefore important to consider further contextual evidence, such as protein— 
DNA interaction data for the TF of interest in order to increase prediction accuracy. 
ChIP-seq data for over 100 TFs in dozens of cell lines and tissues is available from 
ENCODE and Roadmap Epigenomics projects. The JASPAR database provides the 
largest open access database of TF motifs, and software such as MOODS (Korhonen 
et al. 2009) and the MEME suite (Bailey et al. 2009) can be used to align these motifs 
to sequence of interest and to check the effect of sequence variants. The Ensembl 
VEP identifies variants that overlap TF motifs lying in matched ChIP-seq peaks and 
identifies if the variant allele increases or decreases the match to the motif consensus 
sequence and if the variant lies in a high-information position within the motif. 

Active regulatory regions are often recognisable by an accessible chromatin 
environment, and so assays which identify regions of open chromatin, such as 
DNasel hypersensitivity and FAIRE (formaldehyde-assisted identification of regu- 
latory elements), can help identify regulatory elements. DNase1 footprinting (Neph 
et al. 2012b) can identify specific genomic regions that are likely bound by proteins 
even when the specific factor cannot be identified and so provide a more specific 
prediction of a functionally important region. Data from both assays are again avail- 
able in a wide range of tissues and cell lines. The potential role of variants in estab- 
lishing accessible chromatin is still not well understood, but Degner et al. (2012) 
find thousands of variants with significant association with differential chromatin 
accessibility and argue that variants in these regions may make an important contri- 
bution to phenotypic variation. 

Other available data include epigenetic marks such as DNA methylation and 
various histone modifications that mark actively transcribed or repressed genomic 


64 G.R.S. Ritchie and P. Flicek 


regions and which are associated with regulatory elements such as enhancers and 
promoters. Two recent software packages, ChromHMM (Ernst and Kellis 2012) 
and Segway (Hoffman et al. 2012), integrate open chromatin and histone modification 
data to segment the entire genome into distinct functional regions. They find that 
these methods identify biologically important regions such as transcription start 
sites and enhancers. Annotations from these tools may be used to identify the likely 
functional context of non-coding variants, though we have relatively little under- 
standing of the effect of sequence variation on the elements discovered, and because 
these tools do not take the sequence into account, it is not possible to compare dif- 
ferent predictions for different alleles. 

Data from the various techniques discussed here are typically made available in 
BED (or similar) format (see the Appendix for a description of this file format), and 
so variants can be annotated as overlapping or lying near these elements as described 
earlier. There are also Web resources available to identify occupied annotations 
given variant identifiers or coordinates. RegulomeDB (Boyle et al. 2012) finds over- 
laps with a wide range of data from the ENCODE project and TF motif alignments 
and then assigns a rule-based score based on the consistency and specificity of avail- 
able annotations. HaploReg (Ward and Kellis 2012b) similarly finds overlaps with 
non-coding annotations but also provides information about linked variants and 
their associated annotations. 


Conservation and Constraint 


Genomic regions conserved by natural selection over evolutionary time are likely to 
be functionally important. By comparing the human sequence to that of other pri- 
mate and mammalian genomes, we can identify regions and even specific nucleo- 
tides that appear to be under constraint. Conservation metrics derived from these 
sequence alignments provide a powerful means to identify potentially functional 
sequence features even in the absence of further evidence and can be used to iden- 
tify and prioritise potentially important variant loci, even within annotation catego- 
ries. Indeed, several of the quantitative approaches we discussed above make 
extensive use of conservation information, either at the DNA or protein sequence 
levels, to derive their scores. 

There are several methods that can provide nucleotide resolution conservation 
scores (important for annotating SNVs), including GERP (Davydov et al. 2010) and 
phyloP (Siepel et al. 2006), which are based on different algorithmic approaches, 
but which both use multiple sequence alignments to identify genomic regions with 
less variation than would be expected under some background model. Nucleotide 
level conservation scores can also be used to identify runs of especially constrained 
sequence, which may correspond to functional elements, and these regions can also 
be used as an informative regional annotation. 

Conservation has proven to be an important signal in coding regions, but many 
regulatory elements appear to have a much faster evolutionary rate, and there is fre- 
quently little detectable evolutionary conservation, for example, Schmidt et al. (2010) 
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find that most binding events for the two transcription factors they study are species 
specific even among vertebrates. The recent availability of allele frequency data 
across the genome from projects such as the 1000 Genomes Project (Consortium, 
The 1000 Genomes Project 2012) offers an alternative approach to estimating con- 
straint on sequence features at potentially shorter timescales than possible using 
interspecies comparison. Ward and Kellis (2012a) use several metrics of sequence 
diversity such as variant density, heterozygosity and derived allele frequency 
computed from the 1000 Genomes Project data to demonstrate that a wide range of 
non-coding elements demonstrate detectable levels of constraint in human popula- 
tions. These measures can potentially be used to prioritise variants according to the 
constraint of overlapping annotations. 


Integrative Approaches 


Recently, two complementary techniques have been released that integrate a wide 
variety of the classes of data discussed above with the aim of prioritising candidate 
functional variants. GWAVA (Ritchie et al. 2014) is a method aimed to identify 
likely functional regulatory variants and consists of a classifier trained to discrimi- 
nate between annotated regulatory variants involved in human disease from the 
HGMD from several different sets of control variants from the 1000 Genomes 
Project. Features used to differentiate between these classes of variants include 
genetic context, regulatory annotations, conservation and measures of variation in 
human populations. The authors demonstrate that the method can identify likely 
functional variants in a number of contexts relevant to human genetics studies. 
CADD (Kircher et al. 2014) is also an integrative approach that includes several of 
the same annotations used in GWAVA, but is also applicable to variants in coding 
regions as it incorporates transcript-level annotations from the Ensembl VEP and 
predictions from SIFT and PolyPhen (described earlier). Instead of training on 
known disease-implicated variants, CADD is trained to discriminate between vari- 
ants that have become fixed in the human lineage, which presumably represent tol- 
erable variation, from simulated variants unobserved in human populations. This 
approach is appealing as it can assign a single score to variants falling in any class 
of genomic element and supports a systematic approach to ranking and prioritising 
variants across the genome. 


Overlap with Known Variants and Associated Loci 


While the majority of variants discovered so far in the human genome have not been 
characterised, an obvious aid to the interpretation of some candidate variant is to 
check for co-located or nearby variants with some established phenotypic association. 
These data may take a range of forms, from statistical association with a complex 
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phenotype such as a GWAS signal to empirical evidence that the variant results in 
increased expression of some particular gene. Locus-level phenotypic annotation, 
such as the effect of a gene knockout in a model organism, can also provide useful 
insight into the possible functional role of a genetic or regulatory variant. 

There are a number of useful databases that can be consulted to find known phe- 
notype associations; these can typically be queried either by the variant locus or 
phenotype of interest. The HGMD (Stenson et al. 2009) aims to collect variants that 
are “responsible for human inherited disease” and contains thousands of variants 
curated from the literature that have been implicated in a wide range of human dis- 
eases, though with a bias towards monogenic disorders. The Online Mendelian 
Inheritance in Man (OMIM) resource also includes detailed characterisation of 
human genes and associated phenotypes and includes some related genetic variants. 
The NHGRI GWAS catalogue (Hindorff et al. 2009) collects information from 
GWAS studies and identifies both specific variants and nearby loci associated with 
the relevant phenotypes. 

Even in the absence of any phenotypic data, it is useful to establish if a candidate 
variant is novel or has been discovered before to find allele frequency information 
in different populations. A rare variant in one population may be common else- 
where in the world, and as discussed above, allele frequency can be informative 
about functional constraint. Data from large variant discovery studies such as the 
HapMap, 1000 Genomes and NHLBI Exome Sequencing Projects can be used to 
find allele frequencies for several populations around the world. These data are also 
collated centrally in the Ensembl and dbSNP databases, among other resources. 


Summary 


Next-generation association studies using sequencing technologies are already 
exploring the phenotypic consequences of novel variants at lower allele frequencies 
than previously feasible, and we expect to find variants with direct effects on phe- 
notypic variation. The various resources we have reviewed here can of course be 
used after an association analysis has been performed to identify candidate func- 
tional variants among those linked to the association signals and to inform hypoth- 
eses for experimental validation. However, by identifying variants a priori more 
likely to play a functional role in the trait of interest, annotations may also be used 
to increase power to discover loci in the first place. This might be especially fruitful 
for rare variant studies where the sample sizes needed to reliably detect associations 
using single locus tests are still prohibitive. In a recent study, Schork et al. (2013) 
find that trait-associated variants are substantially enriched in various functional 
categories and that annotations can help identify associations that are more likely to 
replicate in independent samples. We anticipate that careful incorporation of anno- 
tation resources into future association studies will yield substantial insights into the 
contribution of rare variants to human phenotypes. 
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Appendix 


Relevant variant and annotation file formats: 


GFF (General Feature Format): A line-oriented, tab-delimited text file format for 
describing the location of genomic features. GFF was originally designed to rep- 
resent gene models but is now used for a wide range of genomic features. The 
format requires the following eight columns on each line: sequence name, fea- 
ture source, feature name, start coordinate, end coordinate, score, strand and 
frame. The ninth column can contain any number of attributes represented as 
tag-value pairs separated by semicolons. 


— http://www.sequenceontology.org/gff3.shtml 


BED (Browser Extensible Data): BED is also a general format for describing 
genomic features and again is a line-oriented text file which uses whitespace to 
delimit data columns. Only three columns are required for a valid BED file: the 
chromosome (or scaffold) name, the start coordinate and the end coordinate. 
There are nine further optional fields to include further information such as the 
name of the feature, associated scores and various display configurations that 
define how the data is represented in a genome browser. Large BED files can be 
converted to an efficient binary format known as bigBed. 


— https://genome.ucsc.edu/FAQ/FAQformat.html 


GTF (General Transfer Format): Originally a version of GFF specialised for rep- 
resenting gene models, GTF is now identical to GFF version 2. 

VCF (Variant Call Format): A text file format designed to represent sequence 
variants (SNVs, indels and structural variants) called against a reference 
sequence, with a line representing each individual variant. Required tab-delimited 
columns define the position and alleles of the variant, and further columns can 
include genotypes, quality scores and QC filters. VCF also supports the inclusion 
of arbitrary metadata, such as functional annotations for variants, in the INFO 
column (often identified with a “CSQ” tag). 


— http://www. 1000genomes.org/wiki/Analysis/Variant%20Call1%20Format/ 
vef-variant-call-format-version-41 


GVF (Genome Variation Format): A version of GFF (version 3) specialised for 
representing genomic variants. The same columns as required for GFF are also 
required, but there are also a number of required attributes in the ninth column 
to include variant identifiers and allele sequences, etc. Optional attributes are 
also available which can represent functional annotations such as genetic 
consequences. 


— http://www.sequenceontology.org/resources/gvf.html 


SAM (Sequence Alignment/Map Format): A tab-delimited text format for repre- 
senting sequence reads aligned against some reference sequence (typically a 
reference genome assembly). Each line represents the alignment of a single read 
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and has 11 mandatory fields that include details of the alignment sequence, posi- 
tion, quality and a compact representation of the alignment itself in CIGAR 
format. There is also an efficient binary version of SAM known as BAM. The 
SAMtools package can be used to convert between SAM and BAM formats. 


— http://samtools.sourceforge.net/ 


e WIG (Wiggle Track Format): WIG format is used to represent quantitative data 
across a reference sequence such as conservation scores, GC percentage, etc. 
It is again a line-oriented format with the value corresponding to each reference 
position represented on a separate line. Data can be represented with either fixed 
or variable steps between each data point. Large WIG files can be converted to an 
efficient indexed binary format called big Wig. 


— https://genome.ucsc.edu/FAQ/FAQformat.html 
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